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Abstract 

We propose a numerical technique for parameter inference in Markov models 
of biological processes. Based on time-series data of a process we estimate the 
kinetic rate constants by maximizing the likelihood of the data. The computation 
of the likelihood relies on a dynamic abstraction of the discrete state space of the 
Markov model which successfully mitigates the problem of state space largeness. 
We compare two variants of our method to state-of-the-art, recently published 
methods and demonstrate their usefulness and efficiency on several case studies 
from systems biology. 

1 Introduction 

A widely-used strategy in systems biology research is to refine mathematical models of 
biological processes based on both computer simulations and wet-lab experiments. In 
this context, parameter estimation methods for quantitative models play a major role. 
Typically, time series data is analyzed to learn the structure of a biochemical reaction 
network and to calibrate the reaction rate parameters. Direct measurement of param- 
eters through wet-lab experiments is often difficult or even impracticable. There are 
extensive research efforts to estimate the reaction rate parameters of ordinary differen- 
tial equations (ODEs) that describe the evolution of the chemical concentrations over 
time (see, for instance, J5] 5] [T) and the references therein). The problem of finding 
parameters that minimize the difference between observed and predicted data is usu- 
ally multimodal due to non-linear constraints and thus requires global optimization 
techniques. 

The assumption that chemical concentrations change deterministically and continu- 
ously in time is not always appropriate for biological processes. In particular, if certain 
substances in the cell are present in small concentrations the resulting stochastic effects 
cannot be adequately described by deterministic models. In that case, discrete-state 
stochastic models are advantageous because they take into account the discrete random 
nature of chemical reactions. The theory of stochastic chemical kinetics provides a 
rigorously justified framework for the description of chemical reactions where the ef- 
fects of molecular noise are taken into account J6). It is based on discrete-state Markov 
processes that explicitly represent the reactions as state -transitions between population 
vectors. When the molecule numbers are large, the solution of the ODE description 
of a reaction network and the mean of the corresponding stochastic model agree up to 
a small approximation error. If, however, small populations are involved, then only a 
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stochastic description can provide probabilities of events of interest such as probabil- 
ities of switching between different expression states in gene regulatory networks or 
the distribution of gene expression products. Moreover, even the mean behavior of the 
stochastic model can largely deviate from the behavior of the deterministic model lfl2l . 
In such cases the parameters of the stochastic model rather then the parameters of the 
deterministic model have to be estimated IfTTl [T5l [19l . 

Here, we consider noisy time series measurements of the system state as they are 
available from wet-lab experiments. Recent experimental imaging techniques such 
as high-resolution fluorescence microscopy can measure small molecule counts with 
measurement errors of less than one molecule [7]. We assume that the structure of 
the underlying reaction network is known but the rate parameters of the network are 
unknown. Then we identify those parameters that maximize the likelihood of the time 
series data. Maximum likelihood estimators are the most popular estimators since they 
have desirable mathematical properties. Specifically, they become minimum variance 
unbiased estimators and are asymptotically normal as the sample size increases. 

Our main contribution consists in devising an efficient algorithm for the numerical 
approximation of the likelihood and its derivatives w.r.t. the reaction rate constants. 
Previous techniques are based on Monte-Carlo sampling IfTTl [191 because the discrete 
state space of the underlying model is typically infinite in several dimensions and a 
priori a reasonable truncation of the state space is not availabe. Our method is not 
based on sampling but directly calculates the likelihood using a dynamic truncation of 
the state space. More precisely, we first show that the computation of the likelihood is 
equivalent to the evaluation of a product of vectors and matrices. This product includes 
the transition probability matrix of the associated continuous-time Markov process, i.e., 
the solution of the Kolmogorov differential equations (KDEs). Since solving the KDEs 
is infeasible, we propose two iterative approximation algorithms during which the state 
space is truncated in an on-the-fly fashion, that is, during a certain time interval we 
consider only those states that significantly contribute to the likelihood. One approach 
exploits equidistant observation intervals while the other approach is particularly well 
suited for observation intervals that are not equidistant. Both approaches take into 
account measurement noise during the observations. 

After introducing the stochastic model in Section|2] we discuss dynamic state space 
truncations for the transient probability distribution and its derivatives in Section[3] We 
introduce the maximum likelihood method in Section [4] and present the approxima- 
tion methods in Section [5] Finally, we report on experimental results for two reaction 
networks (Section |6]l and discuss related work in Section [7] 

2 Discrete-state Stochastic Model 

According to Gillespie's theory of stochastic chemical kinetics, a well-stirred mixture 
of n molecular species in a volume with fixed size and fixed temperature can be rep- 
resented as a continuous-time Markov chain {X(£),t > 0} 10. The random vector 
X(i) = (Xi(t), . . . ,X n (t)) describes the chemical populations at time t, i.e., Xi(t) 
is the number of molecules of type i € { 1 , . . . , n} at time t. Thus, the state space of 
X is Z™ = {0, 1, . . .}". The state changes of X are triggered by the occurrences of 
chemical reactions, which are of m different types. For j € { 1 , . . . , m} let v_,- <G U 1 be 
the nonzero change vector of the j-th reaction type, that is, v 3 - = vj + where v~ 
contains only non-positive entries, which specify how many molecules of each species 
are consumed (reactants) if an instance of the reaction occurs. The vector v+ con- 
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tains only non-negative entries, which specify how many molecules of each species 
are produced (products). Thus, if X(t) = x for some x € Z" with x + v J being 
non-negative, then X(t + dt) = x + Vj is the state of the system after the occurrence 
of the j-th reaction within the infinitesimal time interval [t, t + dt). 

Each reaction type has an associated propensity function, denoted by ai, . . . , ce m , 
which is such that ay (x) ■ dt is the probability that, given X(i) = x, one instance of the 
j-th reaction occurs within [t, t + dt). The value <x,-(x) is proportional to the number 
of distinct reactant combinations in state x. More precisely, if x = {x\, . . . , x n ) is a 
state for which x + is nonnegative then, for reactions with at most two reactants, 
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if VT =(0,...,0), 



a,-(x) = { 1 - f J - (1) 

c ..(x i)=c ..2 i i^ll ifv7= _ 2 . e . 

where i ^ £, Cj > 0, and is the vector with the i-th entry 1 and all other entries 0. 

Example 1 We consider the simple gene expression model described in 4751/ that in- 
volves three chemical species, namely DNAon, DNAoff, and mRNA, which are repre- 
sented by the random variables X\ (t), X2 (t), and X3 (t), respectively. The three pos- 
sible reactions are DNA n -> DNA ff, DNA ff — >• DNA n, and DNAon -> DNA n+ 
mRNA. Thus, Vj" = (-1,0,0), v+ = (0,1,0), = (0,-1,0), v+ = (1,0,0), 
V3" = (— 1, 0, 0) and V3 = (1,0,1). For a state x = (xi,X2,%3), the propensity 
functions are ai(x) = C\ ■ X\, Q.% (x) = ci ■ X2, and a 3 (x) - C3 ■ x\. Note that given 
the initial state x = (1, 0, 0), at any time, either the DNA is active or not, i.e. x\ = 
and X2 = 1, or x\ = \ and X2 = 0. Moreover, the state space of the model is infinite in 
the third dimension. For a fixed time instant t > 0, no upper bound on the number of 
mRNA is known a priori. All states x with X3 € Z + have positive probability ift>Q 
but these probabilities will tend to zero as £3 — > 00. 

In general, the reaction rate constants Cj refer to the probability that a randomly 
selected pair of reactants collides and undergoes the 7-th chemical reaction. It depends 
on the volume and the temperature of the system as well as on the microphysical prop- 
erties of the reactant species. Since reactions of higher order (requiring more than two 
reactants) are usually the result of several successive lower order reactions, we do not 
consider the case of more than two reactants. 

The Chemical Master Equation. For x e Z™ and t > 0, let p(x, t) denote the 
probability Pr(X(t) = x) and let p(i) be the row vector with entries p(x, t). 

Given vj~, . . . , v^, vf, . . . , v,+ , oi\, . . . , a m , and some initial distribution p(0), 
the Markov chain X is uniquely specified and its evolution is given by the chemical 
master equation (CME) 

£p(t) = p(t)Q, (2) 

where Q is the infinitesimal generator matrix of X with Q(x, y) = ctj (x) if y = x+ vj 
and x + vj > 0. Note that, in order to simplify our presentation, we assume here that 
all vectors Vj are distinct. All remaining entries of Q are zero except for the diagonal 
entries which are equal to the negative row sum. The ordinary first-order differential 
equation in (f2| is a direct consequence of the Kolmogorov forward equation. Since X 
is a regular Markov process, (f2) has the general solution p(t) = p(0) ■ e^*, where e A is 
the matrix exponential of a matrix A. If the state space of X is infinite, then we can only 
compute approximations of p(<). But even if Q is finite, its size is often large because it 
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grows exponentially with the number of state variables. Therefore standard numerical 
solution techniques for systems of first-order linear equations of the form of (f2]i are 
infeasible. The reason is that the number of nonzero entries in Q often exceeds the 
available memory capacity for systems of realistic size. If the populations of all species 
remain small (at most a few hundreds) then the CME can be efficiently approximated 
using projection methods ||9][14][3] or fast uniformization methods |[T3l[T6l . The idea 
of these methods is to avoid an exhaustive state space exploration and, depending on a 
certain time interval, restrict the analysis of the system to a subset of states. 

Here, we are interested in the partial derivatives of p(t) w.r.t. the reaction rate 
constants c = (ci, . . . , c m ). In order to explicitly indicate the dependence of p(t) 
on the vector c we write p(c,t) instead of p(t) and p(x,c,t) instead of p(x, i) if 
necessary. We define the row vectors (c, t) as the derivative of p(c, t) w.r.t. Cj, i.e., 

„ .(„ +\ _ dp(c,t) _ p(c+Acj,f)-p(c,f) 
SjK '*) - dc 3 - llm Ac-*0 Ac ' 

where the vector Acj is zero everywhere except for the j-th position that is equal to 
Ac. We denote the entry in Sj (c, t) that corresponds to state x by sj (x, c, t). Using ©, 
we find that Sj (c, t) is the unique solution of the system of ODEs 

f t s 3 (c, t) = Sj (c, t)Q + p(c, t)-£-Q, (3) 

where j e {1, . . . , m}. The initial condition is Sj(x, c, 0) = for all x and c since 
p(x, c, 0) is independent of cj. 



3 Dynamic state space truncation 

The parameter estimation method that we propose in Section ISTTI builds on the approx- 
imation of the transient distribution p(t) and the derivatives Sj (c, t) for all j at a fixed 
time instant t > 0. Therefore we now discuss how to solve and Q simultaneously 
using an explicit fourth-order Runge-Kutta method and a dynamically truncated state 
space. This truncation is necessary because models of chemical reaction networks typ- 
ically have a very large or infinite number of states x with nonzero values for p(x, t) 
and s 3 -(x, c, t). For instance, the system in Example [T]is infinite in one dimension. In 
order to keep the number of states, that are considered in a certain step of the numerical 
integration, manageable we suggest a dynamic truncation of the state space that, for a 
given time interval, neglects those states being not relevant during that time, that is, we 
neglect states that have a probability that is smaller than a certain threshold. 

First, we remark that the equation that corresponds to state x in (f2]i is given by 

^p(x, t) = Ej:x-v- >0 a 3 ( X ^ V J M X - V J > *) - a j ( X M X , *)• ( 4 ) 

and it describes the change of the probability of state x as the difference between 
inflow of probability at rate <x, (x — vj ) from direct predecessors x — Vj and outflow 
of probability at rate dj(x). Assume now that an initial distribution p(0) is given. We 
choose a small positive constant 5 and define the set of significant states S = {x | 
p(x, 0) > 8}. We then only integrate equations in (O and (0 that belong to states in 
S. If h is the time step of the numerical integration, then for the interval [t, t + h) we 
use the following strategy to modify S according to the probability flow. We check 
for all successors x + Vj ^ S of a state x € S whether p(x + Vj, t + h) becomes 
greater than 5 at time t + h as they receive "inflow" from their direct predecessors 
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(see Eq. ©). If the probability that x + Vj receives is greater 6, then we add x + Vj 
to S. Note that since we use a fourth-order method, states reachable within at most 
four transitions from a state in S can be added during one step of the integration. On 
the other hand, whenever p(x, t) becomes less or equal to 6 for a state x € S then 
we remove x from S. We approximate the probabilities and derivatives of all states 
that are not considered during [t, t + h) with zero. In this way the computational 
costs of the numerical integration is drastically reduced, since typically the number of 
states with probabilities less than 8 is large and the main part of the probability mass 
is concentrated on a small number of significant states. Due to the regular structure 
of X, the probability of a state decreases exponentially with its distance to the "high 
probability" locations. If S is small (e.g. 10~ 15 ) and the initial distribution is such 
that the main part of the probability mass (e.g. 99.99%) distributes on a manageable 
number of states, then even for long time horizons the approximation of the transient 
distribution is accurate. For arbitrary Markov models, the approximation error of the 
derivatives could, in principle, be large. For biochemical reaction networks, however, 
the underlying Markov process is well-structured and the sensitivity of the transient 
distribution w.r.t. the rate constants is comparatively small, i.e., small changes of the 
rate constants result in a transient distribution that differs only slightly from the original 
distribution. Therefore, the derivatives of insignificant states are small and, in order 
to calibrate parameters, it is sufficient to consider the derivatives of probabilities of 
significant states. It is impossible to explore the whole state space and those parts 
containing most of the probability mass are most informative w.r.t. pertubations of the 
rate constants. 

Example 2 We consider a simple enzyme reaction with three reactions that involve 
four different species, namely enzymes (E), substrates (S), complex molecules (C), 
and proteins (P). The reactions are complex formation (E+S—> C), dissociation of 
the complex (C^-E+S), and protein production (C^E+P). The corresponding rate 
functions are ai(x) = C\ ■ X\ ■ x%, a2(x) = ci ■ £3, and 023 (x) = C3 • X3 where 
x = (xi, x%, £3, Xa). The change vectors are given by = (— 1, — 1, 0, 0), = 
(0,0,1,0), v 2 - = (0,0,-1,0), v+ = (1,1,0,0), V3 = (0,0,-1,0), andv+ = 
(1, 0, 0, 1). We start initially with probability one in state x = (1000, 200, 0, 0) and 
compute p(t) and Sj(c,t) for t = 10, c = (1,1,0.1), and j € {1,2,3}. In Table\J\ 
we list the results of the approximation o/p(t) and Sj(c, t). We chose this model be- 
cause it has a finite state space and we can compare our approximation with the values 
obtained for 5 = 0. The column "Time" lists the running times of the computation. 
Obviously, the smaller S the more time consuming is the computation. The remaining 
columns refer to the maximum absolute error of all entries in the vectors p(t) and 
Sj(c, t) where we use as exact values those obtained by setting (5 = 0. Clearly, even if 
S = we have an approximation error due to the numerical integration of (|2|l and Q, 
which is, however, very small compared to the error that originates from the truncation 
of the state space. 

A similar truncation effect can be obtained by sorting the entries of p(t) and succes- 
sively removing the smallest entries until a fixed amount e of probability mass is lost. 
If e is chosen proportional to the time step, then it is possible to bound the total ap- 
proximation error of the probabilities, i.e., e = eh/t where e is the total approximation 
error for a time horizon of length t. If memory requirements and running time are 
more pressing then accuracy, then we can adjust the computational costs of the ap- 
proximation by keeping only the k most probable states in each step for some integer 
k. 
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Table 1: Approximated transient distribution and derivatives of the enzyme reaction 
network. 
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4 Parameter Inference 

Following the notation in lfl5l . we assume that observations of a biochemical network 
are made at time instances ti,...,tn € M>o where t\ < ... < t^. Moreover, we 
assume that Oi(ti) is the observed number of species i at time ti for i € {1, . . . , n} 
and £ £ {1, . . . , R}. Let 0(t() = (Oi(tg), . . . , O n (ti)) be the corresponding vector 
of observations. Since these observations are typically subject to measurement errors, 
we assume that Oi(tg) = Xi(ti) + ti{ti) where the error terms e»(^) are independent 
and identically normally distributed with mean zero and standard deviation a. Note 
that Xi(tt) is the true population of the z-th species at time tg. Clearly, this implies 
that, conditional on Xi(ti), the random variable Oi(tg) is independent of all other 
observations as well as independent of the history of X before time tg. 

We assume further that for the unobserved process X we do not know the values of 
the rate constants c%, . . . , c m and our aim is to estimate these constants. Similarly, the 
exact standard deviation a of the error terms is unknown and must be estimated^. Let / 
denote the joint density of 0(t±), . . . , 0(4r). Then the likelihood of the observations 
is OH 

C = /(0(ti),...,0(t fl )) 

= Ex 1 ---Ex R /(0(ii),...,0(ifl)|X(t 1 ) = x 1) ...,X(tfi) = X fi) (5) 
i¥(X(t 1 ) = x 1> ... J X(t fl ) = x fl ) > 

that is, C is the probability to observe 0{t\), . . . , 0(tn). Note that C depends on the 
chosen rate parameters c since the probability measure Pr(-) does. Furthermore, £ 
depends on a since the density / does. When necessary, we will make this dependence 
explicit by writing C(c, a) instead of C. We now seek constants c* and a standard 
deviation a* such that 

C(c* , a*) = max £(c, a) (6) 

cr,c 

where the maximum is taken over all a > and vectors c with all components 
strictly positive. This optimization problem is known as the maximum likelihood prob- 
lem ifTTI . Note that c* and a* are random variables because they depend on the (ran- 
dom) observations O(ti), . . . , 0(tn). 

If more than one sequence of observations is made, then the corresponding like- 
lihood is the product of the likelihoods of all individual sequences. More precisely, 
if O k (ti) is the fc-th observation that has been observed at time instant t\ where k € 

'We remark that it is straightforward to extend the estimation framework that we present in the sequel 
such that a covariance matrix for a multivariate normal distribution of the error terms is estimated. In this 
way, different measurement errors of the species can be taken into account as well as dependencies between 
error terms. 
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{1, . . . , K}, then we define £fc(c, cr) as the probability to observe O k (ti), . . . , O k (t R ) 
and maximize 

nf=iA(c,a). (7) 

In the sequel, we concentrate on expressions for £fe(c, cr) and ^-£fc(c, cr). We first 
assume A' = 1 and drop index fc. We consider the case K > 1 later. In (|5]l we 
sum over all state sequences xi, . . . , x# such that Pr(K(t() = xg, 1 < I < R) > 0. 
Since X has a large or even infinite state space, it is computationally infeasible to 
explore all possible sequences. In SectionQwe propose an algorithm to approximate 
the likelihoods and their derivatives. We truncate the state space in a similar way as in 
Section[3]and use the fact that (|5]l can be written as a product of vectors and matrices. 
Let <j) a be the density of the normal distribution with mean zero and standard deviation 
<t. Then 

/ (O(ii), . . . , 0(t R ) | X(ti) = xi, . . . , X(t fl ) = x fl ) 

= TI?=iTl'L f mu) \ x.^) = Xll ) 

= nf=i rnu ^(ac^) - x«), 

where x^ = (.t w , . . . , If we write w(x e ) for ]T™ = i <Pa(Oi(t e ) - xu), then the 
sequence xi, . . . , x# has weight Ylt=i w i x i) an d, thus, 

R 

£ = ^...^ft(X(i 1 )=x 1 ,...,X(f fl )=x Ji )I]«'(x f ). (8) 

Moreover, for the probability of the sequence xi , . . . , x# we have 

Pr(X(ti) =x 1 ,...,X(i R ) = x_r) =p(xi,ii)P 2 (xi,x 2 )---PR(xi{_i,Xi{) 
where P^(x, y) = Pr(X(te) = y | X(t^_i) = x). Hence, ([8]) can be written as 

C = p (xi , 1 1 ) w (xi ) P 2 (xi , x 2 ) w (x 2 ) . . .^Pij(xfl_i,Xii)io(xfl). (9) 

xi x 2 x_n 

Let Pf be the matrix with entries P^(x, y) for all states x, y. Note that P^ is the 
transition probability matrix of X for time step tt — ti-\ and thus the general solution 
gQiU-te-!) Q f tne Kolmogorov forward and backward differential equations 

&Pt = QPt, &Pt=PtQ. 

Using p(ti) = p(£o)Pl with = 0, we can write (O in matrix-vector form as 

C = p(h)P 1 W 1 P 2 W 2 ■ ■ ■ PrWrb. (10) 

Here, e is the vector with all entries equal to one and Wi is a diagonal matrix whose 
diagonal entries are all equal to w(xe) with I G {1, . . . , R}, where We is of the same 
size as Pg. Since it is in general not possible to analytically obtain parameters that 
maximize C, we use optimization techniques to find c* and a*. Typically, such tech- 
niques iterate over values of c and a and increase the likelihood £(c, cr) by following 
the gradient. Therefore, we need to calculate the derivatives j^-£ and -§^C- For g§-£ 
we obtain 



= ■£r{p(t )P l W 1 P 2 W 2 ---P R W R e) 

33 (11) 

= p(*o) (Eti (£;Pi) W e Ui^e Pt'Wv 
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The derivative of £ w.r.t. the standard deviation a is derived analogously. The only 
difference is that Pi, ... , Pr are independent of a but W\ , . . . , Wr depend on a. It 
is also important to note that expressions for partial derivatives of second order can be 
derived in a similar way. These derivatives can then be used for an efficient gradient- 
based local optimization. 

For K > 1 observation sequences we can maximize the log-likelihood 

iogIlE=iA=E£=ii°g4k, (12) 

instead of the likelihood in O, where we abbreviate £fc(c, a) by £&. Note that the 
derivatives are then given by 

£ E f =1 i og £ fe = £f =1 4^, (B) 

where A is either Cj or a. It is also important to note that only the weights u>(x^) 
depend on k, that is, on the observed sequence O k (ti), . . . , O k (tn). Thus, when we 
compute Ck based on ( TTOb we use for all k the same transition matrices Pi, ... , Pr 
and the same initial conditions p(to), but possibly different matrices Wi, . . . , Wr. 

5 Numerical approximation algorithm 

In this section, we focus on the numerical approximation of the likelihood and the 
corresponding derivatives w.r.t. the rate constants c%, . . . , c m . We propose two approx- 
imation algorithms for the likelihood and its derivatives, a state-based likelihood ap- 
proximation (SLA) and a path-based likelihood approximation (PLA). Both are based 
on a dynamic truncation of the state space as suggested in Section [3] They differ in 
that the PLA method exploits equidistant time series, that is, it is particularly efficient 
if h = ti + i — tg for all £ and if a is not too large. The SLA algorithm works for 
arbitrarily spaced time series and is efficient even if a is large. 

5.1 State-based likelihood approximation 

The SLA algorithm calculates an approximation of the likelihood based on (TTOb by 
traversing the matrix- vector product from the left to the right. The main idea behind 
the algorithm is that instead of explicitly computing the matrices Pi, we express the 
vector-matrix product u(ti-i)Pg as a system of ODEs similar to the CME (cf. Eq. (0). 
Here, u(<o), ■ • ■ , u(i^) are row vectors obtained during the iteration over time points 
to, . . . ,tu, that is, we define £ recursively as £ = u(t^)e with u(to) = p(io) and 

u(ti) = u(t^i)P e W e for all 1 < I < R, 

where to = 0. Instead of computing P( explicitly, we solve R systems of ODEs 

f t u(t) = u(t)Q (14) 

with initial condition u(^_i) = u(^_i) for the time interval [tt-i,tt) where £ £ 
{1, . . . , R}. After solving the ^-th system of ODEs we set u(t^) = u(i^) Wt and finally 
compute £ = u(f #)e. Since this is the same as solving the CME for different initial 
conditions, we can use the dynamic truncation of the state space proposed in Section|3] 
Since the vectors u(t() do not sum up to one, we scale all entries by multiplication 
with l/(u(^)e). This simplifies the truncation of the state space using the significance 



threshold 6 since after scaling it can be interpreted as a probability. In order to obtain 
the correct (unsealed) likelihood, we compute C as £ = n^Li("(^) e )- We handle 
the derivatives of £ in a similar way. To shorten our presentation, we only consider 
the derivative -£r-C in the sequel. An iterative scheme for -^C is derived analogously. 

From ( fTTT i we obtain -£r-C = Uj(tn)e with Uj(to) = and 

u 3 (t e ) = (uj(jbt-i)Pt + n(ti-i)-^Pt)W t far all 1 <£<R, 

where is the vector with all entries zero. Thus, during the solution of the £-th ODE 
in dT4b we simultaneously solve 

£ t u j (t)=u j (t)Q + u(t)£:Q (15) 

with initial condition Uj(^_i) = Uj(^_i) for the time interval [ti-i,te). As above, 
we set Uj(t() = Uj(ti)Wi and obtain -£r-C as Uj(tft)e. 

Solving (TBI and (fTBT l simultaneously is equivalent to the computation of the partial 
derivatives in © with different initial conditions. Thus, we can use the approximation 
algorithm proposed in Section [3] to approximate Uj(ti). Experimental results of the 
finite enzyme reaction network (see Example |2]i show that the approximation errors of 
the likelihood and its derivatives are of the same order of magnitude as those of the 
transient probabilities and their derivatives (not shown). Note, however, that, if a is 
small only few states contribute significantly to the likelihood. In this case, truncation 
strategies based on sorting of vectors are more efficient without considerable accuracy 
losses since the main part of the likelihood concentrates on very few entries (namely 
those that correspond to states that are close to the observed populations). 

In the case of K observation sequences we repeat the above algorithm in order to 
sequentially compute Ck for k £ {1, . . . , K}. We exploit (TT2l i and dT3l l to compute the 
total log-likelihood and its derivatives as a sum of individual terms. Obviously, it is 
possible to parallelize the SLA algorithm by computing Ck in parallel for all k. 

5.2 Path-based likelihood approximation 

If At = ti — ti-\ for all £ then the matrices Pi, ... , Pr in ( TTOb are equal to the Af -step 
transition matrix T(At) with entries Pr(X(t + At) = y | X(t) = x). Note that since 
we consider a time-homogeneous Markov process X, the matrix T(At) is independent 
of t. The main idea of the PLA method is to iteratively compute those parts of T(At) 
that correspond to state sequences (paths) xi , . . . , Xr that contribute significantly to 
C. The algorithm can be summarized as follows, where we omit the argument At of T 
to improve the readability and refer to the entries of T as T(x, y): 

1. We compute the transient distribution p(ii) and its derivatives (w.r.t. c and a) as 
outlined in Section fusing a significance threshold S. 

2. For each state xi with significant probability p(xi ,ti) we approximate the rows 
of T and -£r-T that correspond to xi based on a transient analysis for At time 
units. More precisely, if e Xl is the vector with all entries zero except for the entry 
that corresponds to state xi which is one, then we solve (fJI with initial condition 
e Xl for At time units in order to approximate T(xi,X2) and ^-T(xi,X2) for 
all X2. During this transient analysis we again apply the dynamic truncation of 
the state space proposed in Section|3]with threshold 6. 
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3. We then store for each pair (xi,X2) the (partial) likelihood a(xi,X2) and its 
derivatives: 

a(xi,x 2 ) = p(xi,*i) • to(xi) • T(xi,x 2 ) • u>(x 2 ) 
^-a(xi,x 2 ) = ^-j>(xi,ii) ■ «;(xi) ■ T(xi,x 2 ) • to(x 2 ) 

+p(xi,ti) • iu(xi) • ^-T(xi,x 2 ) • w(x 2 ). 

4. We reduce the number of considered pairs by sorting a(xi,x 2 ) for all pairs 
(xi,x 2 ) calculated in the previous step and keep the most probable pairs (see 
also Section©. 

5. Next, we repeat steps 2-4, where in step 2 we start the analysis from all states 
x 2 that are the last element of a pair kept in the previous step. In step 3 we store 
triples of states, say, (xi,x 2 ,X3) and recursively compute their likelihood and 
the corresponding derivatives by multiplication with T(x 2 ,X3) and iu(x3), i.e., 
for the likelihood we compute 



Note that we may reuse some of the entries of T since they already have been 
calculated in a previous step. In step 4 we again reduce the number of triples 
(xi,x 2 ,X3) by sorting them according to their likelihood. We then keep the 
most probable triples, and so on. Note that in step 4 we cannot use a fixed 
truncation threshold 5 to reduce the number of state sequences (or paths) since 
their probabilities may become very small as the sequences become longer. 

6. We stop the prolongation of paths xi , . . . , x^ when the time instance tji = At-R 
is reached and compute an approximation of C and its derivatives by summing 
up the corresponding values of all paths (cf. Eq. (|8)). 

If we have more than one observation sequence, i.e., K > 1, then we repeat the pro- 
cedure to compute Ck for all k and use ( fT2l to calculate the total log-likelihood. Note 
that the contribution of each path xi, . . . , x^ to Ck may be different for each k. It is, 
however, likely that the entries of T can be reused not only during the computation of 
each single Ck but also for different values of k. If many entries of T are reused dur- 
ing the computation, the algorithm performs fast compared to other approaches. For 
our experimental results in Section [6] we keep the ten most probable paths in step 4. 
Even though this enforces a coarse approximation, the likelihood is approximated very 
accurately if a is small, since in this case only few paths contribute significantly to Ck- 
On the other hand, if a is large, then the approximation may become inaccurate de- 
pending on the chosen truncation strategy. Another disadvantage of the PLA method is 
that for non-equidistant time series, the performance is slow since we have to compute 
(parts of) different transition matrices and, during the computation of Ck, the transition 
probabilities cannot be reused. 

6 Experimental results 

In this section we present experimental results of the SLA and PLA method. For 
equidistant time series, we compare our approach to the approximate maximum like- 



a(xi,x 2 ,x 3 ) 
^-a(xi,x 2 ,x 3 ) 



a(xi,x 2 ) • T(x 2 ,x 3 ) • iy(x 3 ) 
^-a(xi , x 2 ) • T(x 2 , x 3 ) • iu(x 3 ) 
+a(x x ,x 2 ) • ^-T(x 2 ,x 3 ) • w(x 3 ). 
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lihood (AML) and the singular value decomposition (SVDL) method described by 
Reinker et al. lfl"5l (compare also Section |7). Since an implementation of the AML 
and SVDL method was not available to us, we chose the same examples and experi- 
mental conditions for the time series as Reinker et al. and compared our results to those 
listed in the results section in 1131 . We also consider non-equidistant time series. To 
the best of our knowledge there exists no direct numerical approach for non-equidistant 
time series with measurement error that is based on the maximum likelihood method. 

We generated time series data for two different examples from systems biology 
using Monte-Carlo simulation [6,| and added error terms ei{ti) to the population of the 
i-th species at time tf . Besides the simple network described in Example[T]we consider 
a more complex network with eight reactions and five species for the transcription 
regulation of a repressor protein [TT31 : 



mRNA 
M 

DNA.D 
mRNA 



mRNA + M 



mRNA + DNA.D 



DNA + D 
DNA.D 
M + M 
D 



DNA.D 

DNA+D 

D 

M + M 



The initial molecular populations are (2, 4, 2, 0, 0) forM, D, DNA, mRNA, and DNA.D. 
The reachable state space of the model is infinite in three dimensions since the popula- 
tions of mRNA, M, and D are unbounded. The rate constants are c = (0.043, 0.0007, 
0.0715, 0.00395, 0.02, 0.4791 , 0.083, 0.5). For the network in Example[T]we chose the 
same parameters as Reinker et al., namely c = (0.0270, 0.1667, 0.40). 

For the generation of time series data we fix the (true) constants c and the standard 
deviation a of the error terms. We use the SLA and PLA method to estimate c and 
a such that the likelihood of the time series becomes maximal under these parame- 
ters. Since in practice only few observation sequences are available, we estimate the 
parameters based on K = 5 observation sequences. As suggested by Reinker et al., 
we repeat the generation of batches of five observation sequences and the estimation 
of parameters 100 times to approximate the mean and the standard deviation of the 
estimators. 

Our algorithms for the approximation of the likelihood are implemented in C++ 
and we run them on an Intel Core i7 at 2.8 Ghz with 8 GB main memory. They are 
linked to MATLAB's optimization toolbox which we use to minimize the negative log- 
likelihood. Since we use a global optimization method (MATLAB's global search), 
the running time of our method depends on the tightness of the intervals that we use 
as constraints for the unknown parameters as well as on the number of starting points 
of the global search procedure. We chose intervals that correspond to the order of 
magnitude of the parameters, i.e., if Cj G O(10") for some n £ Z then we use the 
interval [10" _1 , 10™ +1 ] as constraint for Cj. E.g. if Cj =0.1 then n = —1 and we use 
the interval [10~ 2 , 10°]. Moreover, for global search we used 20 starting points for the 
gene expression example and 5 for the transcription regulation example. Note that this 
is the only difference of our experimental conditions compared to Reinker et al. who 
use a local optimization method and start the optimization with the true parameters. 

In both algorithms we choose a significance threshold of S = 10~ 15 . Since the 
PLA method becomes slow if the number of paths that are considered is large, in step 
4 of the algorithm we reduce the number of paths that we consider by keeping only 
the 10 most probable paths. In this way, the computational effort of the PLA method 
remains tractable even in the case of the transcription regulation network. 



11 



Table 2: Estimates for the simple gene expression model using equidistant time series. 
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6.1 Equidistant time series 

In the equidistant case, the length of the observation intervals is At = t( — t^_i for all 
£ € {1, . . . , R}. In Tableland [3] we list the results given in Ifl31 as well as the results 
of our methods. Reinker et al. do not evaluate the AML method for larger intervals 
than At = 1 because, as we will discuss in Section [7] the approximation error of the 
AML method becomes huge in that case. Also, the SVDL method performs poor if a 
is large since it does not include measurement errors in the likelihood. Therefore, no 
results for a > 1.0 are provided in lfl"5l for SVDL. In the first three columns we list 
At, the number R of observation points and the true standard deviation a of the error 
terms. In column "Time" we compare the average running time (in seconds) of one 
parameter estimation (out of 100) for SLA and PLA, i.e., the average running time of 
the maximization of the likelihood based on K = 5 observation sequences. It is not 
meaningful to compare the running times with those in lfl5l since different optimization 
methods are used and experiments were run on different machines. Finally, we list 
estimation results for all four methods (if available). We give the true parameters in the 
column headings and list the average of 100 estimations and the standard deviation of 
the estimates (in brackets). 

For the simple gene expression (Table |2]i and At = 1.0, we find that SLA and PLA 
have a similar accuracy for the estimation of a but are consistently more accurate than 
AML and SVDL for estimating the rate constants. If a = 0.1, then the total absolute 
error for the estimation of c is 0.041, 0.073, 0.016, 0.018 for AML, SVDL, SLA, PLA, 
respectively. For a = 1.0 we have total absolute errors of 0.081, 0.053, 0.026, 0.021 
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for AML, SVDL, SLA, PLA. Finally, for a = 3.0, AML has a total error of 0. 1 39 while 
the error for SLA and PLA is 0.017 and 0.041. For At = 10, the results of the SLA 
and PLA method are accurate even though only 30 observation points are given. Since 
PLA gives a much coarser approximation, its running time is always shorter (about 
three to ten times shorter). If a is large, SLA gives more accurate results than PLA. 

In Table [3] we compare results of the transcription regulation for a = 0. Note 
that, for this example, Reinker et al. only give results for the SVDL method with 
At < 1.0 and a = 0. Here, we compare results for At = 1.0 since in this case the 
SVDL method performs best compared to smaller values of At. The SLA and PLA 
method consistently perform better than the SVDL method since they approximate the 
likelihood more accurately. If a = 0, then the accuracy of SLA and PLA is the same 
(up to the fifth digit). Therefore the results of SLA and PLA are combined in Table [3] 
The running time of SLA is, however, much slower since it does not reuse the entries 
of the transition probability matrix T. For At = 1.0, one parameter estimation based 
on K = 5 observations takes about 30 minutes for SLA and only about 2.4 minutes 
for PLA. For At = 10.0 we have running times about 5 hours(SLA) and 27 minutes 
(PLA). As for the gene expression example, we expect for larger values of a the results 
of SLA to be more accurate than those of PLA. 

6.2 Non-equidistant time series 

Finally, we consider non-equidistant time series, which can only be handled by the 
SLA method. During the Monte-Carlo simulation, we generate non-equidistant time 
series by iteratively choosing tg + i = t( + U(0, 5), where U(0, 5) is a random number 
that is uniformly distributed on (0, 5) and to — 0. Note that the intervals are not 
only different within an observation sequence but also for different k, i.e., the times 
t\, . . . , ti? depend on the number k of the corresponding sequence. We consider the 
transcription regulation model with a = 1.0 and K = 5 as this is our most complex 
example. Note that, since the accuracy of the estimation decreases as a increases, 
we cannot expect a similar accuracy as in Table [3] For a time horizon of t = 500 
the average number of observation points per sequence is R = 500/2.5 = 200. The 
estimates computed by SLA are c\ = 0.0384(0.0343), c* 2 = 0.0010(0.0001), c* 3 = 
0.0642(0.0249), c\ = 0.0044(0.0047), c* 5 = 0.0273(0.0073), c% = 0.5498(0.1992), 
= 0.0890(0.0154), c% = 0.5586(0.0716), and a* = 0.9510(0.0211), where we 
averaged over 100 repeated estimations and give the standard deviation in brackets. 
Recall that the true constants are c x = 0.043, c 2 = 0.0007, c 3 = 0.0715, c 4 = 
0.00395, c 5 = 0.02, c 6 = 0.4791, c 7 = 0.083, and c 8 = 0.5. The average running 
time of one estimation was 19 minutes. 

7 Related work 

In the context of stochastic chemical kinetics, parameter inference methods are either 
based on Bayesian inference EJHSJEO) or maximum likelihood estimation |[T5l[T9l[T7l . 
The advantage of the latter method is that the corresponding estimators are, in a sense, 
the most informative estimates of unknown parameters ifTUI and have desirable math- 
ematical properties such as unbiasedness, efficiency, and normality ffTTl . On the other 
hand, the computational complexity of maximum likelihood estimation is high. If an 
analytic solution of (O is not possible, then, as a part of the nonlinear optimization 
problem, the likelihood and its derivatives have to be calculated. Monte-Carlo simu- 
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Table 3: Estimates for the transcription regulation model using equidistant time series. 



At (R) 


Method 


Average (standard deviation) of parameter estimates 
ci = 0.043 c 2 = 0.0007 c 3 = 0.0715 


c 4 = 0.00395 


1.0 (500) 


SVDL 
PLA/SLA 


0.0477(0.0155 ) 0.0006(0.0004) 0.0645(0.0190) 
0.0447(0.0036) 0.0007(0.0001) 0.0677(0.0115) 


0.0110(0.0195) 
0.0034(0.0014) 
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0.0417(0.0069) 0.0005(0.0002) 0.0680(0.0075) 
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Method 


Average (standard deviation) of parameter estimates 
c 5 = 0.02 c 6 = 0.4791 c 7 = 0.083 


c 8 = 0.5 


1.0 (500) 


SVDL 
PLA/SLA 


0.0159(0.0107) 0.2646(0.0761) 0.0149(0.0143) 
0.0193(0.0008) 0.4592(0.0169) 0.0848(0.0024) 


0.0615(0.0332) 
0.5140(0.0166) 


10.0 (50) 


PLA/SLA 


0.0188(0.0039) 0.4359(0.0822) 0.0836(0.0016) 


0.4892(0.0164) 



lation has been used to estimate the likelihood lfT71[T9] . During the repeated random 
sampling it is difficult to explore those parts of the state space that are unlikely under 
the current rate parameters. Thus, especially if the rates are very different from the 
true parameters, many simulation runs are necessary to calculate an accurate approxi- 
mation of the likelihood. To the best of our knowledge, Reinker et al. provide the first 
maximum likelihood estimation that is not based on Monte-Carlo simulation but cal- 
culates the likelihood numerically ITT31 . They propose the AML method during which 
the matrices Pg are approximated. In order to keep the computational effort low, they 
allow at most two jumps of the Markov process during [tg, tg+i). Moreover, they ig- 
nore all states for which \Oi(tg) — xu\ is greater than Syfcr. This has the disadvantage 
that C is zero (and its derivative) if the values for the rate constants are far off the true 
values. If C is zero, then the derivatives provide no information about how the rate 
constants have to be altered in order to increase the likelihood. Thus, initially very 
good estimates for the rate constants must be known to apply this kind of truncation. 
On the other hand, the method that we propose neglects only insignificant terms of the 
likelihood. For this reason the likelihood and its derivatives do not become zero dur- 
ing the computation and it is always possible to follow the gradient in order to obtain 
higher likelihoods. Another disadvantage of the AML method is that, if the observa- 
tion intervals are longer, the likelihood may not be approximated accurately since the 
assumption that only two reactions occur within an observation interval is not valid. 
Extending the AML approach to more than two steps would result in huge space re- 
quirements and perform slow since the state space is explored in a breath-first search 
manner and too many states would be considered even though their contribution to the 
likelihood is very small. In our approach we allow an arbitrary number of reactions 
during [tg, tg + i]^ Therefore, our method is not restricted to reaction networks where 
the speed of all reactions is at most of the same time scale as the observation intervals. 
The second approach proposed by Reinker et al., called SVDL method, is based on the 
assumption that the propensities <x, stay constant during [tg, tg + \). Again, this assump- 
tion only applies to small observation intervals. Moreover, the SVDL method does not 
take into account measurement errors and is thus only appropriate if a is very small. 
Further differences between the approach of Reinker et al. and our approach are that 
we use a global optimization technique (MATLAB's global search) while Reinker et 
al. use a local solver, namely the quasi-Newton method. Finally, the approach in lfL5l 

2 During one step of our numerical integration, we assume that only four reactions are possible. The time 
step h of the numerical integration does, however, not depend on the [tg, t^+i) but is dynamically chosen in 
such a way that performing more than four steps is very unlikely. 
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requires observations at equidistant time instances, which is not necessary for the SLA 
method. 

8 Conclusion 

Parameter inference for stochastic models of cellular processes demands huge compu- 
tational resources. We proposed two numerical methods, called SLA and PLA, that 
approximate maximum likelihood estimators for a given set of observations. Both 
methods do not make any assumptions about the number of reactions that occur within 
an observation interval. The SLA method allows for an estimation based on arbitrarily 
spaced intervals while the PLA method requires equidistant intervals. 

Many reaction networks involve both small populations and large populations. In 
this case stochastic hybrid models are most appropriate since they combine the advan- 
tages of deterministic and stochastic representations. We plan to extend our algorithms 
to the stochastic hybrid setting proposed in [8] to allow inference for more complex 
networks. Further future work also includes more rigorous truncations for the SLA 
method and the parallelization of the algorithm. 
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